log_loss (cross-entropy / negative log-likelihood)#
log_loss measures how well predicted probabilities match the true labels.
It is the standard objective for logistic regression / softmax classifiers, and a common evaluation metric for probabilistic models.
Learning goals#
understand the binary and multiclass formulas (with notation)
build intuition for why confident mistakes are punished heavily
implement numerically-stable log loss in NumPy (from probabilities and from logits)
see how minimizing log loss trains logistic regression via gradient descent
know when log loss is the right metric (and when it is not)
Quick import#
from sklearn.metrics import log_loss
Table of contents#
Definitions and notation
Intuition (plots)
NumPy implementation (binary + multiclass)
Using log loss to optimize logistic regression
Pros, cons, pitfalls
import numpy as np
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.datasets import make_blobs
from sklearn.metrics import log_loss as sk_log_loss
from sklearn.model_selection import train_test_split
pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(0)
1) Definitions and notation#
Assume we have \(n\) examples.
Binary classification#
True label: \(y_i \in \{0,1\}\)
Predicted probability of the positive class: \(p_i = P(y_i=1 \mid x_i)\)
Per-example log loss (Bernoulli negative log-likelihood) is:
Average (optionally weighted) log loss:
Multiclass classification (\(K\) classes)#
True label: \(y_i \in \{0,1,\dots,K-1\}\)
Predicted probabilities: \(p_{ik} = P(y_i=k \mid x_i)\) with \(\sum_{k=0}^{K-1} p_{ik}=1\)
If we write one-hot targets \(y_{ik} \in \{0,1\}\), then:
Equivalently (using integer labels):
Why the log?#
If a model assigns probability \(p_{i,y_i}\) to the true class, the likelihood of the dataset is:
Taking -log turns a product into a sum:
So minimizing log loss is the same as maximizing likelihood.
From logits (numerical stability)#
Sometimes models output logits (real-valued scores) instead of probabilities.
Binary: logit \(z_i = w^\top x_i + b\), \(p_i = \sigma(z_i)\) where \(\sigma\) is the sigmoid.
A stable per-sample loss is:
Multiclass: logits \(z_{ik}\), softmax probabilities \(p_{ik} = \frac{e^{z_{ik}}}{\sum_j e^{z_{ij}}}\).
Stable loss:
In practice we also clip probabilities with a small \(\varepsilon\) to avoid \(\log(0)\) (which would be \(+\infty\)).
def sigmoid(z):
z = np.asarray(z, dtype=float)
return np.where(z >= 0, 1.0 / (1.0 + np.exp(-z)), np.exp(z) / (1.0 + np.exp(z)))
def softplus(z):
z = np.asarray(z, dtype=float)
return np.logaddexp(0.0, z)
def logsumexp(a, axis=None, keepdims=False):
a = np.asarray(a, dtype=float)
a_max = np.max(a, axis=axis, keepdims=True)
out = np.log(np.sum(np.exp(a - a_max), axis=axis, keepdims=True)) + a_max
if keepdims:
return out
if axis is None:
return out.squeeze()
return np.squeeze(out, axis=axis)
def log_softmax(z, axis=1):
z = np.asarray(z, dtype=float)
return z - logsumexp(z, axis=axis, keepdims=True)
def _weighted_mean(values, sample_weight=None):
values = np.asarray(values, dtype=float)
if sample_weight is None:
return float(np.mean(values))
w = np.asarray(sample_weight, dtype=float)
if w.shape != values.shape:
raise ValueError('sample_weight must have the same shape as values')
w_sum = np.sum(w)
if w_sum <= 0:
raise ValueError('sample_weight must sum to a positive number')
return float(np.sum(w * values) / w_sum)
def log_loss_binary(y_true, y_prob, *, eps=1e-15, sample_weight=None):
"""Binary log loss from probabilities.
Parameters
- y_true: shape (n,), values in {0,1}
- y_prob: shape (n,), predicted P(y=1|x)
"""
y_true = np.asarray(y_true)
y_prob = np.asarray(y_prob, dtype=float)
if y_true.shape != y_prob.shape:
raise ValueError('y_true and y_prob must have the same shape')
if np.any((y_true != 0) & (y_true != 1)):
raise ValueError('y_true must contain only 0/1 labels')
p = np.clip(y_prob, eps, 1.0 - eps)
losses = -(y_true * np.log(p) + (1 - y_true) * np.log(1 - p))
return _weighted_mean(losses, sample_weight=sample_weight)
def log_loss_multiclass(y_true, y_prob, *, eps=1e-15, sample_weight=None):
"""Multiclass log loss from probabilities.
Parameters
- y_true: shape (n,), integer labels in {0,1,...,K-1}
- y_prob: shape (n,K), predicted class probabilities (rows should sum to 1)
"""
y_true = np.asarray(y_true)
y_prob = np.asarray(y_prob, dtype=float)
if y_prob.ndim != 2:
raise ValueError('y_prob must be a 2D array of shape (n_samples, n_classes)')
n_samples, n_classes = y_prob.shape
if y_true.shape != (n_samples,):
raise ValueError('y_true must have shape (n_samples,)')
if np.any((y_true < 0) | (y_true >= n_classes)):
raise ValueError('y_true contains labels outside [0, n_classes)')
p = np.clip(y_prob, eps, 1.0 - eps)
p = p / p.sum(axis=1, keepdims=True)
losses = -np.log(p[np.arange(n_samples), y_true])
return _weighted_mean(losses, sample_weight=sample_weight)
def log_loss_binary_from_logits(y_true, logits, *, sample_weight=None):
"""Binary log loss from logits: softplus(z) - y*z."""
y_true = np.asarray(y_true)
logits = np.asarray(logits, dtype=float)
if y_true.shape != logits.shape:
raise ValueError('y_true and logits must have the same shape')
if np.any((y_true != 0) & (y_true != 1)):
raise ValueError('y_true must contain only 0/1 labels')
losses = softplus(logits) - y_true * logits
return _weighted_mean(losses, sample_weight=sample_weight)
def log_loss_multiclass_from_logits(y_true, logits, *, sample_weight=None):
"""Multiclass log loss from logits: -log softmax(true_class)."""
y_true = np.asarray(y_true)
logits = np.asarray(logits, dtype=float)
if logits.ndim != 2:
raise ValueError('logits must be a 2D array of shape (n_samples, n_classes)')
n_samples, n_classes = logits.shape
if y_true.shape != (n_samples,):
raise ValueError('y_true must have shape (n_samples,)')
if np.any((y_true < 0) | (y_true >= n_classes)):
raise ValueError('y_true contains labels outside [0, n_classes)')
log_probs = log_softmax(logits, axis=1)
losses = -log_probs[np.arange(n_samples), y_true]
return _weighted_mean(losses, sample_weight=sample_weight)
2) Intuition (plots)#
For binary classification:
if the true label is 1, the loss is \(-\log(p)\)
if the true label is 0, the loss is \(-\log(1-p)\)
So being confidently wrong is punished heavily (the loss goes to \(+\infty\) as the predicted probability goes to 0 for the true class).
A key property: log loss is a strictly proper scoring rule. If the true label is Bernoulli with positive rate \(q\), then the expected loss of predicting \(p\) is:
This is the cross-entropy \(H(q,p)\) and it is minimized at \(p=q\).
eps = 1e-6
p = np.linspace(eps, 1 - eps, 800)
loss_y1 = -np.log(p)
loss_y0 = -np.log(1 - p)
q = 0.7
expected_loss = -(q * np.log(p) + (1 - q) * np.log(1 - p))
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=(
'Per-sample log loss as a function of predicted probability',
'Expected log loss when the true positive rate is q',
),
)
fig.add_trace(go.Scatter(x=p, y=loss_y1, name='y=1: -log(p)'), row=1, col=1)
fig.add_trace(go.Scatter(x=p, y=loss_y0, name='y=0: -log(1-p)'), row=1, col=1)
fig.update_xaxes(title_text='predicted probability p', row=1, col=1)
fig.update_yaxes(title_text='loss', row=1, col=1)
fig.add_trace(go.Scatter(x=p, y=expected_loss, name='E[loss]'), row=1, col=2)
fig.add_vline(x=q, line_width=2, line_dash='dash', line_color='black', row=1, col=2)
fig.add_annotation(
x=q,
y=float(expected_loss[np.argmin(np.abs(p - q))]),
text='minimum at p=q',
showarrow=True,
arrowhead=2,
ax=40,
ay=-30,
row=1,
col=2,
)
fig.update_xaxes(title_text='predicted probability p', row=1, col=2)
fig.update_yaxes(title_text='expected loss', row=1, col=2)
fig.update_layout(height=420, legend=dict(orientation='h', yanchor='bottom', y=1.02))
fig.show()
3) NumPy implementation: quick sanity checks#
A small example showing why log loss is sensitive to a single confident mistake.
y_true = np.array([1, 1, 1, 0, 0, 0])
p_good = np.array([0.9, 0.8, 0.7, 0.3, 0.2, 0.1])
p_one_confident_mistake = np.array([0.9, 0.8, 0.01, 0.3, 0.2, 0.99])
print('mean log loss (good):', log_loss_binary(y_true, p_good))
print('mean log loss (one confident mistake):', log_loss_binary(y_true, p_one_confident_mistake))
print('sklearn check:', sk_log_loss(y_true, p_good))
eps = 1e-15
losses_good = -(y_true * np.log(np.clip(p_good, eps, 1 - eps)) + (1 - y_true) * np.log(1 - np.clip(p_good, eps, 1 - eps)))
losses_bad = -(y_true * np.log(np.clip(p_one_confident_mistake, eps, 1 - eps)) + (1 - y_true) * np.log(1 - np.clip(p_one_confident_mistake, eps, 1 - eps)))
fig = go.Figure()
fig.add_trace(go.Bar(x=np.arange(len(y_true)), y=losses_good, name='per-sample loss (good)'))
fig.add_trace(go.Bar(x=np.arange(len(y_true)), y=losses_bad, name='per-sample loss (one confident mistake)'))
fig.update_layout(
barmode='group',
title='A single confident mistake can dominate mean log loss',
xaxis_title='sample index',
yaxis_title='per-sample loss',
)
fig.show()
p_baseline = np.full_like(y_true, y_true.mean(), dtype=float)
print('baseline (predict base rate p=mean(y)):', log_loss_binary(y_true, p_baseline))
mean log loss (good): 0.22839300363692283
mean log loss (one confident mistake): 1.6864438223668599
sklearn check: 0.22839300363692283
baseline (predict base rate p=mean(y)): 0.6931471805599453
Multiclass example#
For multiclass problems you pass a probability matrix of shape (n_samples, n_classes).
The loss for each sample is simply -log(probability_assigned_to_the_true_class).
y_true_mc = np.array([0, 2, 1, 2])
P = np.array(
[
[0.7, 0.2, 0.1],
[0.1, 0.2, 0.7],
[0.2, 0.6, 0.2],
[0.05, 0.05, 0.9],
]
)
print('multiclass log loss (numpy):', log_loss_multiclass(y_true_mc, P))
print('multiclass log loss (sklearn):', sk_log_loss(y_true_mc, P, labels=[0, 1, 2]))
# log-loss from logits should match log-loss from probabilities
Z = np.log(P)
print('multiclass log loss from logits (numpy):', log_loss_multiclass_from_logits(y_true_mc, Z))
multiclass log loss (numpy): 0.33238400682532043
multiclass log loss (sklearn): 0.3323840068253205
multiclass log loss from logits (numpy): 0.33238400682532054
4) Using log loss to optimize logistic regression (NumPy)#
Binary logistic regression models:
and minimizes the average log loss:
A very useful fact for optimization is that the derivative w.r.t. the logit is:
So the gradients are:
Below is a simple gradient descent optimizer that learns w and b by directly minimizing log loss.
def standardize_fit_transform(X):
X = np.asarray(X, dtype=float)
mean = X.mean(axis=0)
std = X.std(axis=0)
std = np.where(std == 0, 1.0, std)
return (X - mean) / std, mean, std
def standardize_transform(X, mean, std):
X = np.asarray(X, dtype=float)
std = np.where(std == 0, 1.0, std)
return (X - mean) / std
def fit_logistic_regression_gd(
X_train,
y_train,
X_val=None,
y_val=None,
*,
lr=0.2,
n_steps=300,
l2=0.0,
):
X_train = np.asarray(X_train, dtype=float)
y_train = np.asarray(y_train)
n_samples, n_features = X_train.shape
w = np.zeros(n_features)
b = 0.0
history = {
'step': [],
'train_loss': [],
'train_acc': [],
'val_loss': [],
'val_acc': [],
}
for step in range(n_steps):
logits = X_train @ w + b
p = sigmoid(logits)
loss = log_loss_binary(y_train, p)
grad_w = (X_train.T @ (p - y_train)) / n_samples + l2 * w
grad_b = float(np.mean(p - y_train))
w -= lr * grad_w
b -= lr * grad_b
pred = (p >= 0.5).astype(int)
acc = float(np.mean(pred == y_train))
history['step'].append(step)
history['train_loss'].append(loss)
history['train_acc'].append(acc)
if X_val is not None and y_val is not None:
logits_val = X_val @ w + b
p_val = sigmoid(logits_val)
history['val_loss'].append(log_loss_binary(y_val, p_val))
history['val_acc'].append(float(np.mean((p_val >= 0.5).astype(int) == y_val)))
return w, b, history
X, y = make_blobs(
n_samples=800,
centers=2,
n_features=2,
cluster_std=2.2,
random_state=0,
)
X_train, X_val, y_train, y_val = train_test_split(
X,
y,
test_size=0.3,
random_state=0,
stratify=y,
)
X_train_s, mean, std = standardize_fit_transform(X_train)
X_val_s = standardize_transform(X_val, mean, std)
w, b, hist = fit_logistic_regression_gd(
X_train_s,
y_train,
X_val=X_val_s,
y_val=y_val,
lr=0.2,
n_steps=250,
)
fig = make_subplots(specs=[[{'secondary_y': True}]])
fig.add_trace(go.Scatter(x=hist['step'], y=hist['train_loss'], name='train log loss'), secondary_y=False)
fig.add_trace(go.Scatter(x=hist['step'], y=hist['val_loss'], name='val log loss'), secondary_y=False)
fig.add_trace(go.Scatter(x=hist['step'], y=hist['train_acc'], name='train accuracy'), secondary_y=True)
fig.add_trace(go.Scatter(x=hist['step'], y=hist['val_acc'], name='val accuracy'), secondary_y=True)
fig.update_xaxes(title_text='gradient descent step')
fig.update_yaxes(title_text='log loss (lower is better)', secondary_y=False)
fig.update_yaxes(title_text='accuracy', range=[0, 1], secondary_y=True)
fig.update_layout(title='Minimizing log loss trains logistic regression', height=420)
fig.show()
print('final train loss:', hist['train_loss'][-1])
print('final val loss:', hist['val_loss'][-1])
final train loss: 0.45572277119982485
final val loss: 0.4442069329282946
x0_min, x0_max = X_train_s[:, 0].min() - 0.8, X_train_s[:, 0].max() + 0.8
x1_min, x1_max = X_train_s[:, 1].min() - 0.8, X_train_s[:, 1].max() + 0.8
x0 = np.linspace(x0_min, x0_max, 220)
x1 = np.linspace(x1_min, x1_max, 220)
xx0, xx1 = np.meshgrid(x0, x1)
grid = np.c_[xx0.ravel(), xx1.ravel()]
prob_grid = sigmoid(grid @ w + b).reshape(xx0.shape)
fig = go.Figure()
fig.add_trace(
go.Contour(
x=x0,
y=x1,
z=prob_grid,
contours=dict(start=0.0, end=1.0, size=0.1),
colorscale='RdBu',
opacity=0.85,
colorbar=dict(title='P(y=1)'),
name='P(y=1)',
)
)
fig.add_trace(
go.Contour(
x=x0,
y=x1,
z=prob_grid,
contours=dict(start=0.5, end=0.5, size=0.5),
showscale=False,
line=dict(color='black', width=3),
hoverinfo='skip',
name='decision boundary (p=0.5)',
)
)
fig.add_trace(
go.Scatter(
x=X_train_s[:, 0],
y=X_train_s[:, 1],
mode='markers',
name='train',
marker=dict(
size=6,
color=y_train,
cmin=0,
cmax=1,
colorscale=[[0, '#1f77b4'], [1, '#d62728']],
line=dict(width=0.5, color='black'),
),
)
)
fig.add_trace(
go.Scatter(
x=X_val_s[:, 0],
y=X_val_s[:, 1],
mode='markers',
name='val',
marker=dict(
size=8,
symbol='x',
color=y_val,
cmin=0,
cmax=1,
colorscale=[[0, '#1f77b4'], [1, '#d62728']],
line=dict(width=1.0, color='black'),
),
)
)
fig.update_layout(
title='Decision boundary after minimizing log loss (standardized feature space)',
xaxis_title='feature 1 (standardized)',
yaxis_title='feature 2 (standardized)',
height=520,
)
fig.show()
5) Pros, cons, pitfalls#
Pros#
Uses probabilities: rewards calibrated predictions, not just correct hard labels.
Strictly proper scoring rule: in expectation, you minimize it by predicting the true conditional probabilities.
Differentiable: works naturally as a training objective (logistic regression, neural nets, softmax models).
Works for multiclass: via categorical cross-entropy.
Cons / caveats#
Harder to interpret than accuracy (units are nats if using natural logs).
Unbounded above: a few confidently wrong predictions can dominate the mean.
Sensitive to label noise: mislabeled points can produce very large losses if the model is confident.
Requires good probability estimates; models that only rank well (AUC) can still have poor log loss.
Common pitfalls#
Passing hard class labels instead of probabilities (log loss expects probabilities).
For multiclass, probability rows must align with label order and sum to 1.
With scikit-learn, if
y_truecontains only one class, passlabels=[...]to define the full label set.Not clipping probabilities leads to
log(0)and infinite loss; use a small \(\varepsilon\).
Where it is a good fit#
When you care about probability quality: risk estimation, triage systems, cost-sensitive decisions.
When you want an evaluation metric that matches the training objective for probabilistic classifiers.
When comparing calibrated models (often alongside calibration curves / Brier score).
Exercises#
Derive \(\partial \ell / \partial z = p - y\) for the binary case.
Implement multiclass gradient descent for softmax regression using
log_loss_multiclass_from_logits.Compare log loss and accuracy on an imbalanced dataset; notice that accuracy can look good even with poor probabilities.
References#
scikit-learn:
sklearn.metrics.log_losshttps://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.htmlCross-entropy and negative log-likelihood (NLL): https://en.wikipedia.org/wiki/Cross_entropy
Proper scoring rules: https://en.wikipedia.org/wiki/Scoring_rule